tutorials/036 - Distributing Calls with Glue Interactive Sessions on Ray.ipynb (169 lines of code) (raw):

{ "cells": [ { "cell_type": "markdown", "metadata": { "editable": true, "trusted": true }, "source": [ "[![AWS SDK for pandas](_static/logo.png \"AWS SDK for pandas\")](https://github.com/aws/aws-sdk-pandas)\n", "\n", "# 36 - Distributing Calls on Glue Interactive sessions\n", "\n", "AWS SDK for pandas is pre-loaded into [AWS Glue interactive sessions](https://docs.aws.amazon.com/glue/latest/dg/is-using-ray.html) with Ray kernel, making it by far the easiest way to experiment with the library at scale." ] }, { "cell_type": "markdown", "metadata": { "editable": true, "trusted": true }, "source": [ "In AWS Glue Studio, choose `Notebook` to create an AWS Glue interactive session:\n", "\n", "![](_static/glue_is_create.png)\n", "\n", "Then select `Ray` as the kernel. The IAM role must trust the AWS Glue service principal.\n", "\n", "![](_static/glue_is_setup.png)\n", "\n", "Once the notebook is up and running you can import the library. You can install `awswrangler` and `modin` as additional dependencies." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "tags": [], "trusted": true, "vscode": { "languageId": "python" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Additional python modules to be included:\n", "awswrangler\n", "modin\n" ] } ], "source": [ "%additional_python_modules awswrangler,modin" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [], "trusted": true, "vscode": { "languageId": "python" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::463623607974:role/service-role/AmazonSageMakerServiceCatalogProductsGlueRole\n", "Trying to create a Glue session for the kernel.\n", "Worker Type: Z.2X\n", "Number of Workers: 5\n", "Session ID: 32566e82-34d2-4db7-adac-cbee573e20bf\n", "Job Type: glueray\n", "Applying the following default arguments:\n", "--glue_kernel_version 0.38.1\n", "--enable-glue-datacatalog true\n", "--auto-scaling-ray-min-workers 1\n", "--additional-python-modules awswrangler,modin\n", "Waiting for session 32566e82-34d2-4db7-adac-cbee573e20bf to get into ready status...\n", "Session 32566e82-34d2-4db7-adac-cbee573e20bf has been created.\n" ] } ], "source": [ "import awswrangler as wr" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "tags": [], "trusted": true, "vscode": { "languageId": "python" } }, "outputs": [], "source": [ "df = wr.s3.read_parquet(path=\"s3://ursa-labs-taxi-data/2017/\")" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "tags": [], "trusted": true, "vscode": { "languageId": "python" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " vendor_id pickup_at ... improvement_surcharge total_amount\n", "0 1 2017-01-09 11:13:28 ... 0.3 15.300000\n", "1 1 2017-01-09 11:32:27 ... 0.3 7.250000\n", "2 1 2017-01-09 11:38:20 ... 0.3 7.300000\n", "3 1 2017-01-09 11:52:13 ... 0.3 8.500000\n", "4 2 2017-01-01 00:00:00 ... 0.3 52.799999\n", "\n", "[5 rows x 17 columns]\n" ] } ], "source": [ "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "<div class=\"alert alert-block alert-warning\">\n", "To avoid incurring a charge, make sure to delete the Jupyter Notebook when you are done experimenting.\n", "</div>" ] } ], "metadata": { "kernelspec": { "display_name": "Glue PySpark", "language": "python", "name": "glue_pyspark" }, "language_info": { "codemirror_mode": { "name": "python", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "Python_Glue_Session", "pygments_lexer": "python3" } }, "nbformat": 4, "nbformat_minor": 4 }